Lectura y depuración de los datos:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot') or plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (12,10)
pd.set_option('display.max_rows', None, 'display.max_columns', None)
elec = pd.read_excel('DatosEleccionesEspaña.xlsx')
elec.head()
| Name | CodigoProvincia | CCAA | Population | TotalCensus | AbstentionPtge | AbstencionAlta | Izda_Pct | Dcha_Pct | Otros_Pct | Izquierda | Derecha | Age_0-4_Ptge | Age_under19_Ptge | Age_19_65_pct | Age_over65_pct | WomanPopulationPtge | ForeignersPtge | SameComAutonPtge | SameComAutonDiffProvPtge | DifComAutonPtge | UnemployLess25_Ptge | Unemploy25_40_Ptge | UnemployMore40_Ptge | AgricultureUnemploymentPtge | IndustryUnemploymentPtge | ConstructionUnemploymentPtge | ServicesUnemploymentPtge | totalEmpresas | Industria | Construccion | ComercTTEHosteleria | Servicios | ActividadPpal | inmuebles | Pob2010 | SUPERFICIE | Densidad | PobChange_pct | PersonasInmueble | Explotaciones | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Abadía | 10 | Extremadura | 336 | 282 | 20.213 | 0 | 60.444 | 35.555 | 1.778 | 1 | 0 | 3.869 | 18.155 | 55.059 | 26.785 | 44.048 | 0.89 | 79.762 | 0.298 | 19.345 | 2.381 | 54.762 | 42.857 | 4.762 | 9.524 | 11.905 | 73.810 | 15.0 | 0.0 | 0.0 | 0.0 | 0.0 | Otro | 216.0 | 326.0 | 4507.5593 | MuyBaja | 3.07 | 1.56 | 28 |
| 1 | Abertura | 10 | Extremadura | 429 | 364 | 25.275 | 0 | 54.779 | 44.118 | 0.368 | 1 | 0 | 1.632 | 13.055 | 56.643 | 30.304 | 50.117 | 1.63 | 90.909 | 2.797 | 7.226 | 16.216 | 32.432 | 51.351 | 8.108 | 8.108 | 10.811 | 67.568 | 11.0 | 0.0 | 0.0 | 0.0 | 0.0 | Otro | 382.0 | 459.0 | 6270.7646 | MuyBaja | -6.54 | 1.12 | 67 |
| 2 | Acebo | 10 | Extremadura | 569 | 569 | 27.241 | 0 | 44.203 | 53.140 | 0.966 | 0 | 1 | 1.230 | 9.139 | 54.834 | 36.028 | 49.033 | 0.70 | 78.910 | 0.703 | 18.102 | 8.197 | 36.066 | 55.738 | 22.951 | 9.836 | 13.115 | 49.180 | 49.0 | 0.0 | 0.0 | 0.0 | 0.0 | Otro | 918.0 | 674.0 | 5702.1000 | MuyBaja | -15.58 | 0.62 | 74 |
| 3 | Acehúche | 10 | Extremadura | 822 | 704 | 30.114 | 1 | 50.813 | 45.325 | 0.000 | 1 | 0 | 4.258 | 14.964 | 60.098 | 24.940 | 51.095 | 0.12 | 93.917 | 0.487 | 5.109 | 7.407 | 61.111 | 31.481 | 16.667 | 5.556 | 16.667 | 59.259 | 50.0 | 0.0 | 0.0 | 0.0 | 0.0 | Otro | 599.0 | 842.0 | 9106.4649 | MuyBaja | -2.38 | 1.37 | 66 |
| 4 | Aceituna | 10 | Extremadura | 623 | 540 | 30.185 | 1 | 44.562 | 49.867 | 0.796 | 0 | 1 | 3.531 | 15.569 | 59.391 | 25.042 | 48.154 | 0.64 | 93.258 | 0.161 | 4.173 | 15.385 | 48.077 | 36.538 | 21.154 | 0.000 | 11.538 | 61.538 | 22.0 | 0.0 | 0.0 | 0.0 | 0.0 | Otro | 394.0 | 625.0 | 4007.6141 | MuyBaja | -0.32 | 1.58 | 96 |
elec.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8119 entries, 0 to 8118 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 8119 non-null object 1 CodigoProvincia 8119 non-null int64 2 CCAA 8119 non-null object 3 Population 8119 non-null int64 4 TotalCensus 8119 non-null int64 5 AbstentionPtge 8119 non-null float64 6 AbstencionAlta 8119 non-null int64 7 Izda_Pct 8119 non-null float64 8 Dcha_Pct 8119 non-null float64 9 Otros_Pct 8119 non-null float64 10 Izquierda 8119 non-null int64 11 Derecha 8119 non-null int64 12 Age_0-4_Ptge 8119 non-null float64 13 Age_under19_Ptge 8119 non-null float64 14 Age_19_65_pct 8119 non-null float64 15 Age_over65_pct 8119 non-null float64 16 WomanPopulationPtge 8119 non-null float64 17 ForeignersPtge 8119 non-null float64 18 SameComAutonPtge 8119 non-null float64 19 SameComAutonDiffProvPtge 8119 non-null float64 20 DifComAutonPtge 8119 non-null float64 21 UnemployLess25_Ptge 8119 non-null float64 22 Unemploy25_40_Ptge 8119 non-null float64 23 UnemployMore40_Ptge 8119 non-null float64 24 AgricultureUnemploymentPtge 8119 non-null float64 25 IndustryUnemploymentPtge 8119 non-null float64 26 ConstructionUnemploymentPtge 8119 non-null float64 27 ServicesUnemploymentPtge 8119 non-null float64 28 totalEmpresas 8114 non-null float64 29 Industria 7931 non-null float64 30 Construccion 7980 non-null float64 31 ComercTTEHosteleria 8110 non-null float64 32 Servicios 8057 non-null float64 33 ActividadPpal 8119 non-null object 34 inmuebles 7981 non-null float64 35 Pob2010 8112 non-null float64 36 SUPERFICIE 8110 non-null float64 37 Densidad 8119 non-null object 38 PobChange_pct 8112 non-null float64 39 PersonasInmueble 7981 non-null float64 40 Explotaciones 8119 non-null int64 dtypes: float64(30), int64(7), object(4) memory usage: 2.5+ MB
elec.nunique()
Name 8102 CodigoProvincia 52 CCAA 19 Population 3597 TotalCensus 3310 AbstentionPtge 5675 AbstencionAlta 2 Izda_Pct 6569 Dcha_Pct 6682 Otros_Pct 4319 Izquierda 2 Derecha 2 Age_0-4_Ptge 3761 Age_under19_Ptge 5891 Age_19_65_pct 6215 Age_over65_pct 6778 WomanPopulationPtge 4524 ForeignersPtge 2329 SameComAutonPtge 6151 SameComAutonDiffProvPtge 4207 DifComAutonPtge 5574 UnemployLess25_Ptge 2342 Unemploy25_40_Ptge 2681 UnemployMore40_Ptge 2751 AgricultureUnemploymentPtge 2525 IndustryUnemploymentPtge 2538 ConstructionUnemploymentPtge 2505 ServicesUnemploymentPtge 2904 totalEmpresas 1225 Industria 307 Construccion 456 ComercTTEHosteleria 802 Servicios 757 ActividadPpal 5 inmuebles 3087 Pob2010 3624 SUPERFICIE 8109 Densidad 4 PobChange_pct 3048 PersonasInmueble 282 Explotaciones 758 dtype: int64
Muchas observaciones y muchas variables en este dataset que a simple vista parece que requiere algo de depuración puesto que hay valores flatantes en algunas columnas, en otras el tipo de variable no está bien asignado y ya veremos si encontramos outliers o valores fuera de rango.
Vamos a asignar correctamente el tipo de variable a las que deben ser categóricas (menos de 10 valores distintos) y también convertiremos a objeto el código de provincia puesto que no es un número que nos interese manipular:
factors = list(elec.loc[:, elec.nunique()<= 10])
elec[factors] = elec[factors].astype('category')
elec['CodigoProvincia'] = elec['CodigoProvincia'].astype('object')
#Vamos a ver ahora un poco más de información de las variables numéricas para luego elegir 10 de ellas
elec.describe()
| Population | TotalCensus | AbstentionPtge | Izda_Pct | Dcha_Pct | Otros_Pct | Age_0-4_Ptge | Age_under19_Ptge | Age_19_65_pct | Age_over65_pct | WomanPopulationPtge | ForeignersPtge | SameComAutonPtge | SameComAutonDiffProvPtge | DifComAutonPtge | UnemployLess25_Ptge | Unemploy25_40_Ptge | UnemployMore40_Ptge | AgricultureUnemploymentPtge | IndustryUnemploymentPtge | ConstructionUnemploymentPtge | ServicesUnemploymentPtge | totalEmpresas | Industria | Construccion | ComercTTEHosteleria | Servicios | inmuebles | Pob2010 | SUPERFICIE | PobChange_pct | PersonasInmueble | Explotaciones | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8.119000e+03 | 8.119000e+03 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8114.000000 | 7931.000000 | 7980.000000 | 8110.000000 | 8057.000000 | 7.981000e+03 | 8.112000e+03 | 8110.000000 | 8112.000000 | 7981.000000 | 8119.000000 |
| mean | 5.741855e+03 | 4.260666e+03 | 26.506951 | 34.403789 | 48.915409 | 14.666183 | 3.019429 | 13.567747 | 57.371541 | 29.073583 | 47.302755 | 5.619553 | 81.629141 | 4.336688 | 10.729018 | 7.322292 | 37.003976 | 50.180442 | 8.400982 | 10.007836 | 10.837496 | 58.649705 | 398.603032 | 23.419367 | 48.878321 | 146.735265 | 172.149684 | 3.246160e+03 | 5.795812e+03 | 6214.695257 | -4.897406 | 1.296009 | 2447.204582 |
| std | 4.621520e+04 | 3.442889e+04 | 7.540091 | 16.482285 | 19.945087 | 25.093642 | 2.053726 | 6.780648 | 6.818072 | 11.745849 | 4.361907 | 7.348553 | 12.289063 | 6.394440 | 8.847295 | 9.408555 | 20.317306 | 22.803515 | 12.958405 | 12.528441 | 13.281177 | 24.259562 | 4219.366083 | 158.610811 | 421.863266 | 1233.023418 | 2446.812300 | 2.431471e+04 | 4.753568e+04 | 9218.194603 | 10.383417 | 0.566620 | 15062.738051 |
| min | 5.000000e+00 | 5.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 23.459000 | 0.000000 | 11.765000 | -8.960000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000e+00 | 5.000000e+00 | 2.578400 | -52.270000 | 0.110000 | 1.000000 |
| 25% | 1.660000e+02 | 1.400000e+02 | 21.678000 | 21.892500 | 38.690500 | 0.759500 | 1.389000 | 8.334000 | 53.845000 | 19.824500 | 45.725000 | 1.060000 | 75.806000 | 0.676000 | 4.933000 | 0.000000 | 28.571000 | 41.667000 | 0.000000 | 0.000000 | 0.000000 | 50.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.800000e+02 | 1.777500e+02 | 1839.191800 | -10.400000 | 0.850000 | 22.000000 |
| 50% | 5.490000e+02 | 4.470000e+02 | 26.429000 | 35.165000 | 51.582000 | 1.883000 | 2.978000 | 13.889000 | 58.655000 | 27.559000 | 48.485000 | 3.590000 | 84.493000 | 2.190000 | 8.271000 | 5.882000 | 39.935000 | 50.000000 | 3.493000 | 7.143000 | 8.333000 | 62.018000 | 30.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.860000e+02 | 5.820000e+02 | 3487.737450 | -4.960000 | 1.250000 | 52.000000 |
| 75% | 2.427500e+03 | 1.846500e+03 | 31.475000 | 46.032000 | 62.201000 | 16.497000 | 4.533000 | 19.058500 | 61.818000 | 36.908000 | 50.000000 | 8.180000 | 90.462000 | 5.277000 | 13.898000 | 10.469500 | 46.667000 | 60.039000 | 11.732500 | 14.286000 | 14.286000 | 72.123000 | 147.000000 | 14.000000 | 25.000000 | 65.000000 | 40.000000 | 1.589000e+03 | 2.483000e+03 | 6893.877800 | 0.092500 | 1.730000 | 137.000000 |
| max | 3.141991e+06 | 2.363829e+06 | 57.576000 | 94.117000 | 100.000000 | 100.000000 | 13.245000 | 33.696000 | 100.002000 | 76.471000 | 72.683000 | 71.470000 | 127.156000 | 67.308000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 299397.000000 | 10521.000000 | 30343.000000 | 80856.000000 | 177677.000000 | 1.615548e+06 | 3.273049e+06 | 175022.910000 | 138.460000 | 3.330000 | 99999.000000 |
Vamos a elegir las siguientes variables solo porque parecen interesantes para analizar (y por ser numéricas):
Creamos el nuevo data frame con solo estas variables más CCAA que es nuestra variable objetivo:
elec1 = elec[["CCAA","Population", "AbstentionPtge", "Izda_Pct", "Age_over65_pct", "WomanPopulationPtge", "UnemployLess25_Ptge",
"UnemployMore40_Ptge", "SUPERFICIE", "totalEmpresas", "Pob2010"]]
elec1.set_index(['CCAA'], inplace = True)
elec1.head()
| Population | AbstentionPtge | Izda_Pct | Age_over65_pct | WomanPopulationPtge | UnemployLess25_Ptge | UnemployMore40_Ptge | SUPERFICIE | totalEmpresas | Pob2010 | |
|---|---|---|---|---|---|---|---|---|---|---|
| CCAA | ||||||||||
| Extremadura | 336 | 20.213 | 60.444 | 26.785 | 44.048 | 2.381 | 42.857 | 4507.5593 | 15.0 | 326.0 |
| Extremadura | 429 | 25.275 | 54.779 | 30.304 | 50.117 | 16.216 | 51.351 | 6270.7646 | 11.0 | 459.0 |
| Extremadura | 569 | 27.241 | 44.203 | 36.028 | 49.033 | 8.197 | 55.738 | 5702.1000 | 49.0 | 674.0 |
| Extremadura | 822 | 30.114 | 50.813 | 24.940 | 51.095 | 7.407 | 31.481 | 9106.4649 | 50.0 | 842.0 |
| Extremadura | 623 | 30.185 | 44.562 | 25.042 | 48.154 | 15.385 | 36.538 | 4007.6141 | 22.0 | 625.0 |
elec1.info()
<class 'pandas.core.frame.DataFrame'> Index: 8119 entries, Extremadura to CastillaLeón Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Population 8119 non-null int64 1 AbstentionPtge 8119 non-null float64 2 Izda_Pct 8119 non-null float64 3 Age_over65_pct 8119 non-null float64 4 WomanPopulationPtge 8119 non-null float64 5 UnemployLess25_Ptge 8119 non-null float64 6 UnemployMore40_Ptge 8119 non-null float64 7 SUPERFICIE 8110 non-null float64 8 totalEmpresas 8114 non-null float64 9 Pob2010 8112 non-null float64 dtypes: float64(9), int64(1) memory usage: 697.7+ KB
elec1.describe()
| Population | AbstentionPtge | Izda_Pct | Age_over65_pct | WomanPopulationPtge | UnemployLess25_Ptge | UnemployMore40_Ptge | SUPERFICIE | totalEmpresas | Pob2010 | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 8.119000e+03 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8119.000000 | 8110.000000 | 8114.000000 | 8.112000e+03 |
| mean | 5.741855e+03 | 26.506951 | 34.403789 | 29.073583 | 47.302755 | 7.322292 | 50.180442 | 6214.695257 | 398.603032 | 5.795812e+03 |
| std | 4.621520e+04 | 7.540091 | 16.482285 | 11.745849 | 4.361907 | 9.408555 | 22.803515 | 9218.194603 | 4219.366083 | 4.753568e+04 |
| min | 5.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 11.765000 | 0.000000 | 0.000000 | 2.578400 | 0.000000 | 5.000000e+00 |
| 25% | 1.660000e+02 | 21.678000 | 21.892500 | 19.824500 | 45.725000 | 0.000000 | 41.667000 | 1839.191800 | 7.000000 | 1.777500e+02 |
| 50% | 5.490000e+02 | 26.429000 | 35.165000 | 27.559000 | 48.485000 | 5.882000 | 50.000000 | 3487.737450 | 30.000000 | 5.820000e+02 |
| 75% | 2.427500e+03 | 31.475000 | 46.032000 | 36.908000 | 50.000000 | 10.469500 | 60.039000 | 6893.877800 | 147.000000 | 2.483000e+03 |
| max | 3.141991e+06 | 57.576000 | 94.117000 | 76.471000 | 72.683000 | 100.000000 | 100.000000 | 175022.910000 | 299397.000000 | 3.273049e+06 |
Es necesario depurar un poco los datos puesto que tenemos algunos missing valules (para totalEmpresas, Pob2010 y Superficie), además a simple vista pareciera que hay outliers en los datos de empresas, población, superficie, sin embargo tratandose de datos poblacionales donde se sabe que puede haber zonas con ragos de población o área muy diferentes entre sí, vamos a trabajar con esos datos tal y como están. Por otro lado, los porcentages, donde sí sería facil reconocer valores fuera de rango (mayores a 100 o menores a 0) estan bien, por lo que los missing pareciera ser el único problema.
Sabemos que el porcentaje de incidencia por variable es mínimo, puesto que en el peor de los casos tenemos 9 faltantes de 8119 datos, así que no habría problema en eliminarlos del data set, pero verifiquemos primero la incidencia por observación y total para estar seguros de que no estaríamos eliminando un porcentaje importante al entrar en juego los missing cruzados:
# Proporción de missings por observación
elec1['prop_missings'] = elec1.apply(lambda x: x.isna().sum()/x.count()*100,axis=1)
# Cantidad y porcentaje de observaciones con al menos un dato perdido:
qty = len(elec1[elec1['prop_missings']>0])
ptg_miss = round(len(elec1[elec1['prop_missings']>0])/len(elec1['prop_missings'])*100, 2)
print('Cantidad de observaciones con datos perdidos: ', qty, '\nPorcentaje: ', ptg_miss)
Cantidad de observaciones con datos perdidos: 12 Porcentaje: 0.15
C:\Users\ferli\AppData\Local\Temp\ipykernel_27504\2676930901.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy elec1['prop_missings'] = elec1.apply(lambda x: x.isna().sum()/x.count()*100,axis=1)
Son apenas 12 observaciones de 8119, lo que equivale a un 0.15% que es insignificante y podríamos prescindir de estos datos, solo vamos a confirmar antes que no pertenezcan a una misma comunidad autónoma para estar seguros de no afectar demasiado
elec1[elec1['prop_missings']>0]
| Population | AbstentionPtge | Izda_Pct | Age_over65_pct | WomanPopulationPtge | UnemployLess25_Ptge | UnemployMore40_Ptge | SUPERFICIE | totalEmpresas | Pob2010 | prop_missings | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CCAA | |||||||||||
| Extremadura | 940 | 31.946 | 44.970 | 17.554 | 47.553 | 10.714 | 32.143 | 47959.3300 | NaN | NaN | 25.000000 |
| Galicia | 1233 | 40.716 | 27.345 | 38.119 | 51.419 | 5.556 | 57.407 | NaN | 78.0 | 1361.0 | 11.111111 |
| Galicia | 6691 | 36.351 | 17.908 | 33.881 | 52.130 | 4.396 | 57.418 | NaN | 363.0 | 7313.0 | 11.111111 |
| Galicia | 5253 | 30.209 | 21.003 | 33.905 | 49.476 | 0.000 | 0.000 | NaN | NaN | NaN | 42.857143 |
| Andalucía | 783 | 19.741 | 76.411 | 20.945 | 49.042 | 20.270 | 39.189 | NaN | NaN | NaN | 42.857143 |
| Andalucía | 2111 | 24.924 | 68.740 | 17.527 | 50.734 | 12.500 | 40.972 | NaN | NaN | NaN | 42.857143 |
| Andalucía | 3341 | 38.566 | 62.968 | 19.185 | 51.182 | 15.734 | 43.706 | 23561.5152 | 138.0 | NaN | 11.111111 |
| Galicia | 3368 | 26.168 | 25.262 | 22.208 | 50.980 | 4.153 | 55.272 | NaN | 170.0 | 3469.0 | 11.111111 |
| Galicia | 1828 | 36.409 | 22.640 | 42.997 | 53.282 | 2.857 | 55.714 | NaN | 71.0 | 2299.0 | 11.111111 |
| Cataluña | 5839 | 36.145 | 28.822 | 14.010 | 50.488 | 9.763 | 46.450 | 5430.0184 | 259.0 | NaN | 11.111111 |
| Melilla | 85584 | 48.650 | 34.417 | 9.684 | 49.064 | 16.363 | 37.472 | NaN | 4349.0 | 76034.0 | 11.111111 |
| Extremadura | 2527 | 31.865 | 50.908 | 17.967 | 50.297 | 8.040 | 53.769 | NaN | NaN | NaN | 42.857143 |
elec1.loc['Melilla']
Population 85584.000000 AbstentionPtge 48.650000 Izda_Pct 34.417000 Age_over65_pct 9.684000 WomanPopulationPtge 49.064000 UnemployLess25_Ptge 16.363000 UnemployMore40_Ptge 37.472000 SUPERFICIE NaN totalEmpresas 4349.000000 Pob2010 76034.000000 prop_missings 11.111111 Name: Melilla, dtype: float64
En vista de que Melilla tiene un valor perdido en superficie y que solamente hay una observación para esta CA, vamos a eliminar todas las observaciones con missing values excepto para Melilla puesto que es un impacto mínimo para las demás comunidades autónomas (son observaciones con muy poca población) y para Melilla haremos una imputación usando los valores de los "vecinos" más cercanos puesto que eliminar implicaría perder del todo a esta comunidad autónoma.
elec2 = elec1[(elec1['prop_missings']==0) | (elec1.index =='Melilla')].drop('prop_missings', axis = 1)
elec2.info()
<class 'pandas.core.frame.DataFrame'> Index: 8108 entries, Extremadura to CastillaLeón Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Population 8108 non-null int64 1 AbstentionPtge 8108 non-null float64 2 Izda_Pct 8108 non-null float64 3 Age_over65_pct 8108 non-null float64 4 WomanPopulationPtge 8108 non-null float64 5 UnemployLess25_Ptge 8108 non-null float64 6 UnemployMore40_Ptge 8108 non-null float64 7 SUPERFICIE 8107 non-null float64 8 totalEmpresas 8108 non-null float64 9 Pob2010 8108 non-null float64 dtypes: float64(9), int64(1) memory usage: 696.8+ KB
import sklearn.impute as skl_imp
imputer_knn = skl_imp.KNNImputer(n_neighbors=3)
elec_imp = pd.DataFrame(imputer_knn.fit_transform(elec2),columns=elec2.columns, index = elec2.index)
elec_imp[['Population','SUPERFICIE']].groupby('CCAA').sum()
| Population | SUPERFICIE | |
|---|---|---|
| CCAA | ||
| Andalucía | 8387381.0 | 8.669027e+06 |
| Aragón | 1317847.0 | 4.763550e+06 |
| Asturias | 1051229.0 | 1.027269e+06 |
| Baleares | 1104479.0 | 4.991700e+05 |
| Canarias | 2100306.0 | 7.383646e+05 |
| Cantabria | 585179.0 | 5.246334e+05 |
| CastillaLeón | 2472052.0 | 9.410554e+06 |
| CastillaMancha | 2059191.0 | 7.932435e+06 |
| Cataluña | 7502267.0 | 3.197172e+06 |
| Ceuta | 84263.0 | 1.341337e+03 |
| ComValenciana | 4980689.0 | 2.352726e+06 |
| Extremadura | 1088694.0 | 4.155715e+06 |
| Galicia | 2713974.0 | 2.917790e+06 |
| Madrid | 6436996.0 | 8.064702e+05 |
| Melilla | 85584.0 | 4.024701e+03 |
| Murcia | 1467288.0 | 1.108311e+06 |
| Navarra | 640476.0 | 9.881824e+05 |
| PaísVasco | 2189257.0 | 7.109612e+05 |
| Rioja | 317053.0 | 5.205546e+05 |
Melilla ha quedado con una superficie relativamente similar a la de Ceuta, entonces si bien no será el valor real, al menos anda cercano, y eso es lo que nos interesa para luego hacer los clusters.
Ya tenemos el set de datos depurado y las variables seleccionadas. Podemos empezar a agrupar
elec_m = elec_imp.groupby('CCAA').agg({
'Population':'sum', 'SUPERFICIE':'sum', 'totalEmpresas':'sum', 'Pob2010':'sum',
'AbstentionPtge':'mean', 'Izda_Pct':'mean', 'Age_over65_pct':'mean', 'WomanPopulationPtge':'mean',
'UnemployLess25_Ptge':'mean', 'UnemployMore40_Ptge': 'mean'})
elec_m
| Population | SUPERFICIE | totalEmpresas | Pob2010 | AbstentionPtge | Izda_Pct | Age_over65_pct | WomanPopulationPtge | UnemployLess25_Ptge | UnemployMore40_Ptge | |
|---|---|---|---|---|---|---|---|---|---|---|
| CCAA | ||||||||||
| Andalucía | 8387381.0 | 8.669027e+06 | 489098.0 | 8370975.0 | 28.705762 | 55.122694 | 20.945208 | 49.362575 | 13.890823 | 42.875152 |
| Aragón | 1317847.0 | 4.763550e+06 | 90048.0 | 1347095.0 | 25.033557 | 41.598175 | 33.307423 | 45.753094 | 6.444003 | 46.946937 |
| Asturias | 1051229.0 | 1.027269e+06 | 67674.0 | 1084341.0 | 33.762987 | 49.701974 | 29.810372 | 49.841897 | 7.168179 | 49.391141 |
| Baleares | 1104479.0 | 4.991700e+05 | 89341.0 | 1106049.0 | 33.574701 | 44.388761 | 17.805015 | 49.460373 | 7.454119 | 51.069343 |
| Canarias | 2100306.0 | 7.383646e+05 | 135909.0 | 2118519.0 | 34.843398 | 39.926080 | 17.617239 | 49.451693 | 5.438432 | 52.023920 |
| Cantabria | 585179.0 | 5.246334e+05 | 37692.0 | 592250.0 | 26.880235 | 38.197441 | 24.182500 | 47.716373 | 7.898245 | 49.900794 |
| CastillaLeón | 2472052.0 | 9.410554e+06 | 160390.0 | 2559515.0 | 23.822925 | 31.514937 | 36.713069 | 45.622654 | 6.583358 | 51.557764 |
| CastillaMancha | 2059191.0 | 7.932435e+06 | 126143.0 | 2098373.0 | 22.698995 | 42.159800 | 32.234108 | 46.331884 | 7.629214 | 48.624712 |
| Cataluña | 7502267.0 | 3.197172e+06 | 595913.0 | 7512381.0 | 34.284757 | 9.688943 | 21.324291 | 48.407013 | 4.904039 | 55.936445 |
| Ceuta | 84263.0 | 1.341337e+03 | 3762.0 | 80579.0 | 47.411000 | 33.027000 | 11.028000 | 49.258000 | 14.928000 | 35.239000 |
| ComValenciana | 4980689.0 | 2.352726e+06 | 344518.0 | 5111706.0 | 21.888273 | 22.619101 | 24.188871 | 48.963561 | 6.256081 | 51.799408 |
| Extremadura | 1088694.0 | 4.155715e+06 | 65281.0 | 1107220.0 | 26.513229 | 51.463457 | 28.337870 | 49.125717 | 11.706470 | 48.945161 |
| Galicia | 2713974.0 | 2.917790e+06 | 196552.0 | 2777804.0 | 30.820100 | 22.620951 | 32.326472 | 50.618087 | 4.741159 | 56.704563 |
| Madrid | 6436996.0 | 8.064702e+05 | 516402.0 | 6458684.0 | 25.072363 | 38.733570 | 15.980966 | 48.564251 | 6.144123 | 51.347034 |
| Melilla | 85584.0 | 4.024701e+03 | 4349.0 | 76034.0 | 48.650000 | 34.417000 | 9.684000 | 49.064000 | 16.363000 | 37.472000 |
| Murcia | 1467288.0 | 1.108311e+06 | 92008.0 | 1461979.0 | 27.383689 | 37.337111 | 16.525956 | 49.159578 | 9.334244 | 45.551578 |
| Navarra | 640476.0 | 9.881824e+05 | 43866.0 | 636924.0 | 30.024482 | 40.204996 | 25.958614 | 46.936813 | 8.472298 | 47.689963 |
| PaísVasco | 2189257.0 | 7.109612e+05 | 151216.0 | 2178339.0 | 31.964825 | 31.297908 | 20.066888 | 48.836092 | 5.700032 | 46.502295 |
| Rioja | 317053.0 | 5.205546e+05 | 23024.0 | 322415.0 | 19.049741 | 37.199534 | 32.763080 | 43.487810 | 2.916609 | 50.325477 |
Sí es necesario escalar los datos puesto que tenemos escalas muy diferentes entre sí para cada una de las columnas.
En cuanto al tipo de distancia a aplicar, vamos a elegir la distancia euclidea, puesto que estamos trabajando únicamente con variables continuas.
from sklearn.preprocessing import scale
s_elec = scale(elec_m)
pd.DataFrame(s_elec)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.424811 | 2.053424 | 1.812292 | 2.402924 | -0.188271 | 1.745616 | -0.361957 | 0.660484 | 1.634669 | -1.058098 |
| 1 | -0.463244 | 0.721302 | -0.455273 | -0.459073 | -0.675233 | 0.449608 | 1.246967 | -1.405240 | -0.468850 | -0.280527 |
| 2 | -0.572163 | -0.553108 | -0.582411 | -0.566136 | 0.482354 | 1.226167 | 0.791831 | 0.934803 | -0.264290 | 0.186232 |
| 3 | -0.550409 | -0.733237 | -0.459290 | -0.557291 | 0.457386 | 0.717020 | -0.770648 | 0.716454 | -0.183520 | 0.506710 |
| 4 | -0.143593 | -0.651650 | -0.194672 | -0.144743 | 0.625625 | 0.289376 | -0.795087 | 0.711487 | -0.752895 | 0.689002 |
| 5 | -0.762554 | -0.724552 | -0.752781 | -0.766647 | -0.430350 | 0.123727 | 0.059372 | -0.281646 | -0.058067 | 0.283558 |
| 6 | 0.008273 | 2.306352 | -0.055561 | 0.034948 | -0.835772 | -0.516634 | 1.690206 | -1.479892 | -0.429486 | 0.599982 |
| 7 | -0.160389 | 1.802179 | -0.250166 | -0.152952 | -0.984813 | 0.503426 | 1.107276 | -1.073996 | -0.134061 | 0.039870 |
| 8 | 2.063223 | 0.187026 | 2.419259 | 2.053076 | 0.551545 | -2.608144 | -0.312619 | 0.113611 | -0.903846 | 1.436159 |
| 9 | -0.967189 | -0.903042 | -0.945585 | -0.975136 | 2.292182 | -0.371739 | -1.652666 | 0.600635 | 1.927642 | -2.516340 |
| 10 | 1.033105 | -0.101007 | 0.990730 | 1.074881 | -1.092321 | -1.369092 | 0.060201 | 0.432126 | -0.521932 | 0.646128 |
| 11 | -0.556858 | 0.513976 | -0.596009 | -0.556814 | -0.479018 | 1.394964 | 0.600187 | 0.524929 | 1.017650 | 0.101065 |
| 12 | 0.107104 | 0.091731 | 0.149926 | 0.123893 | 0.092106 | -1.368914 | 1.119298 | 1.379020 | -0.949855 | 1.582843 |
| 13 | 1.628037 | -0.628420 | 1.967445 | 1.623729 | -0.670087 | 0.175102 | -1.008045 | 0.203599 | -0.553557 | 0.559740 |
| 14 | -0.966650 | -0.902127 | -0.942250 | -0.976988 | 2.456483 | -0.238540 | -1.827585 | 0.489608 | 2.332989 | -2.089914 |
| 15 | -0.402194 | -0.525465 | -0.444135 | -0.412262 | -0.363588 | 0.041284 | -0.937115 | 0.544308 | 0.347563 | -0.546993 |
| 16 | -0.739964 | -0.566440 | -0.717698 | -0.748444 | -0.013399 | 0.316104 | 0.290531 | -0.727792 | 0.104087 | -0.138635 |
| 17 | -0.107254 | -0.660997 | -0.107691 | -0.120369 | 0.243904 | -0.537432 | -0.476268 | 0.359175 | -0.679001 | -0.365439 |
| 18 | -0.872090 | -0.725943 | -0.836131 | -0.876596 | -1.468731 | 0.028101 | 1.176121 | -2.701673 | -1.465239 | 0.364658 |
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
from time import time
def plot_dendogram(model, **kwargs):
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for idx in merge:
if idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack([model.children_, model.distances_,
counts]).astype(float)
# Plot
dendrogram(linkage_matrix, **kwargs)
plt.show()
for linkage in ("ward", "average", "complete", "single"):
clustering = AgglomerativeClustering(linkage=linkage, n_clusters=None,
distance_threshold = 0)
t0 = time()
clustering.fit(s_elec)
print("%s :\t%.2fs" % (linkage, time() - t0))
plt.clf()
plot_dendogram(clustering)
ward : 0.00s
average : 0.00s
complete : 0.00s
single : 0.00s
Escogeremos el linkage de mínima varianza o Ward, que nos da resultados muy claros para 4 grupos.
Evaluamos ahora el modelo:
modelo_hclust_ward = AgglomerativeClustering(
affinity = 'euclidean',
linkage = 'ward',
n_clusters = 4
)
modelo_hclust_ward.fit(X=s_elec)
cluster_labels_w = modelo_hclust_ward.fit_predict(s_elec)
# Silueta Clustering Ward
from sklearn.metrics import silhouette_score
silhouette_score(s_elec, cluster_labels_w)
0.36896276000054923
#Indice de Calinski-Harabasz
from sklearn import metrics
metrics.calinski_harabasz_score(s_elec, cluster_labels_w)
9.37060337359143
# Veamos los centroides:
from sklearn.neighbors import NearestCentroid
clf = NearestCentroid()
clf.fit(s_elec, cluster_labels_w)
print(clf.centroids_)
[[ 1.45125569 0.3205507 1.46793048 1.45570065 -0.26140586 -0.68508653 -0.10062442 0.55776809 -0.25890451 0.63335438] [-0.37186235 1.0259725 -0.3992826 -0.36341829 -0.99113744 0.11612508 1.30514275 -1.66520014 -0.62440882 0.18099554] [-0.96691934 -0.90258449 -0.94391732 -0.97606197 2.37433231 -0.30513918 -1.74012564 0.54512157 2.13031548 -2.30312703] [-0.4793738 -0.48768431 -0.48183592 -0.48408827 0.0653643 0.44640134 -0.1546497 0.34771462 -0.05855915 0.0894375 ]]
El valor de la silueta no es tan alta como nos gustaría, pero al menos tiene un valor positivo, por otro lado el índice de Calinski-Harabasz aún no nos da mucha información, pero lo usaremos para comparar con el siguiente modelo.
Anteriormente vimos que 4 clusters sería lo recomendado de acuerdo al método jerárquico-Ward, pero ahora vamos a utilizar la función 'scree_plot' vista en clase para identificar nuevamente un número óptimo de clusters:
from scipy.spatial.distance import cdist, pdist
from sklearn.cluster import KMeans
def scree_plot_kmeans(data,n_max):
range_n_clusters = range(2, n_max)
X_scaled = scale(data)
inertias = []
silhouette = []
var_perc = []
for n_clusters in range_n_clusters:
modelo_kmeans = KMeans(
n_clusters = n_clusters,
n_init = 20,
random_state = 123
)
modelo_kmeans.fit(X_scaled)
cluster_labels = modelo_kmeans.fit_predict(X_scaled)
inertias.append(modelo_kmeans.inertia_)
silhouette.append(silhouette_score(X_scaled, cluster_labels))
tss = sum(pdist(X_scaled)**2)/X_scaled.shape[0]
bss = tss - modelo_kmeans.inertia_
var_perc.append(bss/tss*100)
fig, ax = plt.subplots(1, 3, figsize=(16, 6))
ax[0].plot(range_n_clusters, inertias, marker='o')
ax[0].set_title("Scree plot Varianza intra")
ax[0].set_xlabel('Número clusters')
ax[0].set_ylabel('Intra-cluster (inertia)')
ax[1].plot(range_n_clusters, silhouette, marker='o')
ax[1].set_title("Scree plot silhouette")
ax[1].set_xlabel('Número clusters')
ax[1].set_ylabel('Media índices silhouette');
ax[2].plot(range_n_clusters, var_perc, marker='o')
ax[2].set_title("Scree plot % Varianza")
ax[2].set_xlabel('Número clusters')
ax[2].set_ylabel('% de varianza explicada')
scree_plot_kmeans(elec_m,10)
plt.show()
Pareciera que en este caso, el número óptimo de clusters son 5, que explica casi el 80% de la varianza.
Creamos ahora el modelo K-means con 5 clusters:
modelo_kmeans = KMeans(n_clusters=5, n_init=25, random_state=123)
modelo_kmeans.fit(X=s_elec)
KMeans(n_clusters=5, n_init=25, random_state=123)
print('Varianza intra: ' + str(modelo_kmeans.inertia_))
print('\nCentroides')
print(modelo_kmeans.cluster_centers_)
print('\nEtiquetas')
print(modelo_kmeans.labels_[:5])
Varianza intra: 44.35015944227528 Centroides [[-0.96691934 -0.90258449 -0.94391732 -0.97606197 2.37433231 -0.30513918 -1.74012564 0.54512157 2.13031548 -2.30312703] [ 1.20786697 -0.11266759 1.38184 1.21889477 -0.27968948 -1.29276209 -0.0352914 0.53208909 -0.73229776 1.05621747] [-0.4793738 -0.48768431 -0.48183592 -0.48408827 0.0653643 0.44640134 -0.1546497 0.34771462 -0.05855915 0.0894375 ] [-0.37186235 1.0259725 -0.3992826 -0.36341829 -0.99113744 0.11612508 1.30514275 -1.66520014 -0.62440882 0.18099554] [ 2.42481057 2.05342382 1.8122924 2.4029242 -0.18827138 1.74561569 -0.36195653 0.66048413 1.63466851 -1.058098 ]] Etiquetas [4 3 2 2 2]
Evaluamos el modelo:
cluster_labels_k = modelo_kmeans.labels_
print('Silueta:', silhouette_score(s_elec, cluster_labels_k))
print('Indice de Calinski-Harabasz:', metrics.calinski_harabasz_score(s_elec, cluster_labels_k))
Silueta: 0.3847878352621361 Indice de Calinski-Harabasz: 11.494309115518341
Tenemos resultados muy similares a los obtenidos con el método de Ward, pero estos están ligeramente mejores; veamos que tan diferentes son los clusters con ambos métodos y cuales CCAA están en cada grupo:
elec_m['cluster_ward'] = cluster_labels_w
elec_m['cluster_kmeans'] = cluster_labels_k
elec_m[elec_m.cluster_ward != elec_m.cluster_kmeans][['cluster_ward', 'cluster_kmeans']]
| cluster_ward | cluster_kmeans | |
|---|---|---|
| CCAA | ||
| Andalucía | 0 | 4 |
| Aragón | 1 | 3 |
| Asturias | 3 | 2 |
| Baleares | 3 | 2 |
| Canarias | 3 | 2 |
| Cantabria | 3 | 2 |
| CastillaLeón | 1 | 3 |
| CastillaMancha | 1 | 3 |
| Cataluña | 0 | 1 |
| Ceuta | 2 | 0 |
| ComValenciana | 0 | 1 |
| Extremadura | 3 | 2 |
| Galicia | 0 | 1 |
| Madrid | 0 | 1 |
| Melilla | 2 | 0 |
| Murcia | 3 | 2 |
| Navarra | 3 | 2 |
| PaísVasco | 3 | 2 |
| Rioja | 1 | 3 |
Si bien es cierto, el número de cluster es diferente en ambos métodos, si analizamos con detenimiento, vemos que las agrupaciones son prácticamente iguales, la única diferencia es que el método de k-means separó a Andalucía en un grupo aparte (el quinto grupo que no existe con el método de Ward). A partir de este punto, posiblemente tengamos resultados muy similares con ambos métodos sobretodo desde el punto de vista visual y de centroides; vamos a elegir el modelo de K-means que está ligeramente mejor en términos de silueta e indice de Calinski-Harabasz.
Vamos a ordenar las comunidades autónomas por el cluster al cual pertenecen, para ver si logramos determinar sus similitudes:
elec_m.sort_values(by = 'cluster_kmeans').drop('cluster_ward', axis = 1)
| Population | SUPERFICIE | totalEmpresas | Pob2010 | AbstentionPtge | Izda_Pct | Age_over65_pct | WomanPopulationPtge | UnemployLess25_Ptge | UnemployMore40_Ptge | cluster_kmeans | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CCAA | |||||||||||
| Ceuta | 84263.0 | 1.341337e+03 | 3762.0 | 80579.0 | 47.411000 | 33.027000 | 11.028000 | 49.258000 | 14.928000 | 35.239000 | 0 |
| Melilla | 85584.0 | 4.024701e+03 | 4349.0 | 76034.0 | 48.650000 | 34.417000 | 9.684000 | 49.064000 | 16.363000 | 37.472000 | 0 |
| Cataluña | 7502267.0 | 3.197172e+06 | 595913.0 | 7512381.0 | 34.284757 | 9.688943 | 21.324291 | 48.407013 | 4.904039 | 55.936445 | 1 |
| ComValenciana | 4980689.0 | 2.352726e+06 | 344518.0 | 5111706.0 | 21.888273 | 22.619101 | 24.188871 | 48.963561 | 6.256081 | 51.799408 | 1 |
| Galicia | 2713974.0 | 2.917790e+06 | 196552.0 | 2777804.0 | 30.820100 | 22.620951 | 32.326472 | 50.618087 | 4.741159 | 56.704563 | 1 |
| Madrid | 6436996.0 | 8.064702e+05 | 516402.0 | 6458684.0 | 25.072363 | 38.733570 | 15.980966 | 48.564251 | 6.144123 | 51.347034 | 1 |
| Navarra | 640476.0 | 9.881824e+05 | 43866.0 | 636924.0 | 30.024482 | 40.204996 | 25.958614 | 46.936813 | 8.472298 | 47.689963 | 2 |
| Asturias | 1051229.0 | 1.027269e+06 | 67674.0 | 1084341.0 | 33.762987 | 49.701974 | 29.810372 | 49.841897 | 7.168179 | 49.391141 | 2 |
| Baleares | 1104479.0 | 4.991700e+05 | 89341.0 | 1106049.0 | 33.574701 | 44.388761 | 17.805015 | 49.460373 | 7.454119 | 51.069343 | 2 |
| Canarias | 2100306.0 | 7.383646e+05 | 135909.0 | 2118519.0 | 34.843398 | 39.926080 | 17.617239 | 49.451693 | 5.438432 | 52.023920 | 2 |
| Cantabria | 585179.0 | 5.246334e+05 | 37692.0 | 592250.0 | 26.880235 | 38.197441 | 24.182500 | 47.716373 | 7.898245 | 49.900794 | 2 |
| Murcia | 1467288.0 | 1.108311e+06 | 92008.0 | 1461979.0 | 27.383689 | 37.337111 | 16.525956 | 49.159578 | 9.334244 | 45.551578 | 2 |
| PaísVasco | 2189257.0 | 7.109612e+05 | 151216.0 | 2178339.0 | 31.964825 | 31.297908 | 20.066888 | 48.836092 | 5.700032 | 46.502295 | 2 |
| Extremadura | 1088694.0 | 4.155715e+06 | 65281.0 | 1107220.0 | 26.513229 | 51.463457 | 28.337870 | 49.125717 | 11.706470 | 48.945161 | 2 |
| Rioja | 317053.0 | 5.205546e+05 | 23024.0 | 322415.0 | 19.049741 | 37.199534 | 32.763080 | 43.487810 | 2.916609 | 50.325477 | 3 |
| CastillaLeón | 2472052.0 | 9.410554e+06 | 160390.0 | 2559515.0 | 23.822925 | 31.514937 | 36.713069 | 45.622654 | 6.583358 | 51.557764 | 3 |
| Aragón | 1317847.0 | 4.763550e+06 | 90048.0 | 1347095.0 | 25.033557 | 41.598175 | 33.307423 | 45.753094 | 6.444003 | 46.946937 | 3 |
| CastillaMancha | 2059191.0 | 7.932435e+06 | 126143.0 | 2098373.0 | 22.698995 | 42.159800 | 32.234108 | 46.331884 | 7.629214 | 48.624712 | 3 |
| Andalucía | 8387381.0 | 8.669027e+06 | 489098.0 | 8370975.0 | 28.705762 | 55.122694 | 20.945208 | 49.362575 | 13.890823 | 42.875152 | 4 |
Con base en la tabla ordenada por clusters, podemos intuir que la población o incluso en superficie podrían tener algo más de influencia sobre la agrupación, pero teniendo tantas variables, sería dificil encontrar 2 o 3 variables que se puedan graficar y expliquen la agrupación realizada. Acá nos serviría más bien un análisis de componentes principales para una mejor visualización de los grupos con base en esos componentes.
from sklearn import decomposition as dc
from pca import pca
# Initialize pca with default parameters
pcaModel = pca(normalize=True,n_components=2)
results = pcaModel.fit_transform(elec_m.drop('cluster_ward', axis = 1))
[pca] >Processing dataframe.. [pca] >Normalizing input data per feature (zero mean and unit variance).. [pca] >The PCA reduction is performed on the [11] columns of the input dataframe. [pca] >Fit using PCA. [pca] >Compute loadings and PCs. [pca] >Compute explained variance. [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[2] [pca] >Outlier detection using SPE/DmodX with n_std=[2]
pcaModel.plot()
(<Figure size 1500x1000 with 1 Axes>,
<AxesSubplot:title={'center':'Cumulative explained variance\n 2 Principal Components explain [65.94%] of the variance.'}, xlabel='Principle Component', ylabel='Percentage explained variance'>)
<Figure size 1200x1000 with 0 Axes>
Con 2 componentes explicamos unicamente el 65% de la variación, si subieramos a 3, podríamos explicar cerca del 80%, pero con fines de poder visualizar mejor los clusters, vamos a trabajar con estos dos componentes
# Acceso a las cargas
pcaModel.results['loadings']
| Population | SUPERFICIE | totalEmpresas | Pob2010 | AbstentionPtge | Izda_Pct | Age_over65_pct | WomanPopulationPtge | UnemployLess25_Ptge | UnemployMore40_Ptge | cluster_kmeans | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PC1 | -0.307732 | -0.306864 | -0.301913 | -0.310262 | 0.399468 | 0.067233 | -0.324144 | 0.181005 | 0.323635 | -0.358863 | -0.297107 |
| PC2 | 0.416770 | -0.058824 | 0.434959 | 0.415537 | 0.221417 | -0.243070 | -0.347585 | 0.366550 | 0.109908 | 0.008755 | -0.292477 |
# Visualización biplot para interpretar componentes
pcaModel.biplot(legend=False)
[pca] >Plot PC1 vs PC2 with loadings. [colourmap]> Warning: Colormap [Set1] can not create [19] unique colors! Available unique colors: [9]. [colourmap]> Warning: Colormap [Set1] can not create [19] unique colors! Available unique colors: [9].
(<Figure size 1500x1000 with 1 Axes>,
<AxesSubplot:title={'center':'2 Principal Components explain [65.94%] of the variance'}, xlabel='PC1 (37.6% expl.var)', ylabel='PC2 (28.3% expl.var)'>)
Con base en lo anterior, podemos interpretar que el PC1 nos habla principalmente de % de abstención, desempleo (proporcional para menores de 25, inversamente proporcional para mayores de 40) y superficie también con una relación inversa (al aumentar PC1, menor superficie); mientras que PC2 nos habla principalmente de poblaciones (total, % de mujeres, de personas de izquierda y de personas mayores, para los últimos dos, es inversamente proporcional), además del total de empresas
# Acceso a scores
pcaModel.results['PC']
# Unimos variable de cluster_kmeans
elec_pca = pcaModel.results['PC'].join(elec_m['cluster_kmeans'])
elec_pca.sort_values(by = 'cluster_kmeans')
| PC1 | PC2 | cluster_kmeans | |
|---|---|---|---|
| Ceuta | 4.776327 | 0.958855 | 0 |
| Melilla | 4.864795 | 1.032096 | 0 |
| Cataluña | -2.440261 | 3.830334 | 1 |
| ComValenciana | -1.529535 | 1.747195 | 1 |
| Galicia | -0.928213 | 0.791379 | 1 |
| Madrid | -1.417782 | 2.680517 | 1 |
| Navarra | 0.693211 | -1.365749 | 2 |
| Asturias | 0.701941 | -0.875903 | 2 |
| Baleares | 1.044479 | -0.206041 | 2 |
| Canarias | 0.482153 | 0.332124 | 2 |
| Cantabria | 0.537125 | -1.204489 | 2 |
| Murcia | 1.084965 | -0.063255 | 2 |
| PaísVasco | 0.467143 | 0.271167 | 2 |
| Extremadura | 0.431680 | -1.131705 | 2 |
| Rioja | -1.366352 | -3.254432 | 3 |
| CastillaLeón | -2.564761 | -1.689988 | 3 |
| Aragón | -1.069790 | -2.202332 | 3 |
| CastillaMancha | -1.673014 | -1.795331 | 3 |
| Andalucía | -2.094109 | 2.145558 | 4 |
# Visualización del biplot por cluster
import plotly.express as px
fig = px.scatter(elec_pca, x='PC1',y='PC2', color='cluster_kmeans', text=elec_m.index)
fig.show()
Ahora si, logramos ver grupos bien definidos y separados entre sí, aunque Andalucía sigue estando muy combinado con el grupo 1, tal y como lo habíamos clasificado en el modelo de Ward.
Interpretando un poco los grupos en relación a los componentes principales, tenemos que: